Frequent Itemsets

Question 1

Suppose we have transactions that satisfy the following assumptions:

  • s, the support threshold, is 10,000.
  • There are one million items, which are represented by the integers 0,1,...,999999.
  • There are N frequent items, that is, items that occur 10,000 times or more.
  • There are one million pairs that occur 10,000 times or more.
  • There are 2M pairs that occur exactly once. M of these pairs consist of two frequent items, the other M each have at least one nonfrequent item.
  • No other pairs occur at all.
  • Integers are always represented by 4 bytes.

Suppose we run the a-priori algorithm to find frequent pairs and can choose on the second pass between the triangular-matrix method for counting candidate pairs (a triangular array count[i][j] that holds an integer count for each pair of items (i, j) where i < j) and a hash table of item-item-count triples. Neglect in the first case the space needed to translate between original item numbers and numbers for the frequent items, and in the second case neglect the space needed for the hash table. Assume that item numbers and counts are always 4-byte integers.

As a function of N and M, what is the minimum number of bytes of main memory needed to execute the a-priori algorithm on this data? Demonstrate that you have the correct formula by selecting, from the choices below, the triple consisting of values for N, M, and the (approximate, i.e., to within 10%) minumum number of bytes of main memory, S, needed for the a-priori algorithm to execute with this data.


In [1]:
import os
import sys

# N = 100,000; M = 50,000,000; S = 5,000,000,000
# N = 40,000; M = 60,000,000; S = 3,200,000,000
# N = 50,000; M = 80,000,000; S = 1,500,000,000
# N = 100,000; M = 100,000,000; S = 1,200,000,000
soln = [[100000, 50000000, 5000000000],
        [40000, 60000000, 3200000000],
        [50000, 80000000, 1500000000],
        [100000, 100000000, 1200000000]]

In [2]:
def a(N, M):
    memory = min(12*(M + 10**6), 4*N+2*N*(N-1))
    #memory = min(12*(M + 10**6), 4*N*N)
    return memory

In [3]:
def pct(e, a):
    return 100 * (abs(e - a) / float(a))

In [4]:
for n, m, s in soln:
    print pct(s, a(n, m))


716.993464052
337.158469945
54.3209876543
0.990099009901

Answer: N = 100,000; M = 100,000,000; S = 1,200,000,000

Question 2

Imagine there are 100 baskets, numbered 1,2,...,100, and 100 items, similarly numbered. Item i is in basket j if and only if i divides j evenly. For example, basket 24 is the set of items {1,2,3,4,6,8,12,24}. Describe all the association rules that have 100% confidence. Which of the following rules has 100% confidence?


In [5]:
baskets = range(1,101)
items = range(1,101)

# Create transactions
transactions = []

for i in baskets:
    basket = []
    for item in items:
        if i % item == 0:
            basket.append(item)
    transactions.append(basket)

In [6]:
def check(transactions,query):
    count=0
    for t in transactions:
        query_in = True
        for q in query:
            if q not in t:
                query_in = False
        if query_in:
            count+=1
    return count

In [7]:
def confidence(num,denom):
    count_denom = check(transactions,denom)
    count_num = check(transactions,num)
    confidence = count_num /(1.0* count_denom ) * 100
    return confidence

In [8]:
print "{1,2}-> 4,Condidence = %d"%(confidence([1,2,4],[1,2]))   
print "{1}-> 2,Condidence = %d"%(confidence([1,2],[1]))   
print "{1,4,7}-> 14,Condidence = %d"%(confidence([1,4,7,14],[1,4,7]))   
print "{1,3,6}-> 12,Condidence = %d"%(confidence([1,3,6,12],[1,3,6]))   
print "{4,6}-> 12,Condidence = %d"%(confidence([4,6,12],[4,6]))   
print "{8,12}-> 96,Condidence = %d"%(confidence([8,12,96],[8,12]))   
print "{4,6}-> 24,Condidence = %d"%(confidence([4,6,24],[4,6]))   
print "{1,3,6}-> 12,Condidence = %d"%(confidence([1,3,6,12],[1,3,6]))


{1,2}-> 4,Condidence = 50
{1}-> 2,Condidence = 50
{1,4,7}-> 14,Condidence = 100
{1,3,6}-> 12,Condidence = 50
{4,6}-> 12,Condidence = 100
{8,12}-> 96,Condidence = 25
{4,6}-> 24,Condidence = 50
{1,3,6}-> 12,Condidence = 50

Question 3

Suppose we perform the PCY algorithm to find frequent pairs, with market-basket data meeting the following specifications:

  • s, the support threshold, is 10,000.
  • There are one million items, which are represented by the integers 0,1,...,999999.
  • There are 250,000 frequent items, that is, items that occur 10,000 times or more.
  • There are one million pairs that occur 10,000 times or more.
  • There are P pairs that occur exactly once and consist of 2 frequent items.
  • No other pairs occur at all.
  • Integers are always represented by 4 bytes.
  • When we hash pairs, they distribute among buckets randomly, but as evenly as possible; i.e., you may assume that each bucket gets exactly its fair share of the P pairs that occur once.

Suppose there are S bytes of main memory. In order to run the PCY algorithm successfully, the number of buckets must be sufficiently large that most buckets are not large. In addition, on the second pass, there must be enough room to count all the candidate pairs. As a function of S, what is the largest value of P for which we can successfully run the PCY algorithm on this data? Demonstrate that you have the correct formula by indicating which of the following is a value for S and a value for P that is approximately (i.e., to within 10%) the largest possible value of P for that S.


In [9]:
import os
import sys

In [10]:
# S = 1,000,000,000; P = 10,000,000,000
# S = 300,000,000; P = 750,000,000
# S = 1,000,000,000; P = 20,000,000,000 (correct)
# S = 100,000,000; P = 120,000,000

soln = [[1000000000, 10000000000],
        [300000000, 750000000],
        [1000000000, 20000000000],
        [100000000, 120000000]]

In [11]:
def b(S, P):
    memory = S*S/48000000 - P
    return memory

In [12]:
def pct(e, a):
    return 100 * (abs(e - a) / float(a))
    return 100 * (a/(e*e/48000000))

In [13]:
for s, p in soln:
    print "Difference between P and S^2/48,000,000 = %d" % b(s, p)
    print "Percentage of P to the largest possible P value for that S = %d" % pct(s,p)
    print


Difference between P and S^2/48,000,000 = 10833333333
Percentage of P to the largest possible P value for that S = 90

Difference between P and S^2/48,000,000 = 1125000000
Percentage of P to the largest possible P value for that S = 60

Difference between P and S^2/48,000,000 = 833333333
Percentage of P to the largest possible P value for that S = 95

Difference between P and S^2/48,000,000 = 88333333
Percentage of P to the largest possible P value for that S = 16

Answer: Largest P value that is < S^2/48000000, within 10% to the largest possible value of P for that S.


In [ ]: